AIML_ML_Project_full_code_notebook¶

New notebook

Machine Learning: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [ ]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [3]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    make_scorer,
)

import warnings
warnings.filterwarnings("ignore")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 11, Finished, Available)

Observation :¶

The code executed in cell 3 attempted to install specific versions of libraries using pip

Loading the dataset¶

In [4]:
Loan = pd.read_csv("Files/Loan_Modelling.csv") 
display(Loan)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 12, Finished, Available)
SynapseWidget(Synapse.DataFrame, a7a06ff7-8493-4a88-90c5-7592d557c107)

Observation:¶

The CSV file "Loan_Modelling.csv" was successfully read into a Pandas DataFrame.

The DataFrame Loan contains the following columns: ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard

In [5]:
# copying data to another variable to avoid any changes to original data
data = Loan.copy()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 13, Finished, Available)

Data Overview¶

  • Observations
  • Sanity checks

View the first and last 5 rows of the dataset.¶

In [6]:
data.head()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 14, Finished, Available)
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [7]:
data.tail()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 16, Finished, Available)
Out[7]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Understand the shape of the dataset.¶

In [8]:
# Display the shape of the dataset
print("Shape of the Loan_Modelling dataset:")
print(data.shape)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 18, Finished, Available)
Shape of the Loan_Modelling dataset:
(5000, 14)

Observation :¶

This script will load the Loan_Modelling dataset into a pandas DataFrame and then print out the shape of the dataset, which includes the number of rows and columns in the dataset.

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

Check the data types of the columns for the dataset¶

In [10]:
# Display the data types of columns in the Loan_Modelling dataset
#print(data.dtypes)
data.info()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 22, Finished, Available)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Observation :¶

The df.info() provides a concise summary of the DataFrame df, including information about the data types, non-null values, and memory usage. It displays the column names, count of non-null values in each column, data type of each column, and the total memory usage by the DataFrame. This method is useful for quickly understanding the structure of the dataset and identifying any missing values or potential data type inconsistencies. By using df.info(), you can gain insights into the dataset's composition and prepare for further data preprocessing or analysis tasks.

Checking the Statistical Summary¶

In [11]:
# Display the statistical summary of the Loan_Modelling dataset
print(data.describe())
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 25, Finished, Available)
ID          Age   Experience       Income       ZIPCode  \
count  5000.000000  5000.000000  5000.000000  5000.000000   5000.000000   
mean   2500.500000    45.338400    20.104600    73.774200  93169.257000   
std    1443.520003    11.463166    11.467954    46.033729   1759.455086   
min       1.000000    23.000000    -3.000000     8.000000  90005.000000   
25%    1250.750000    35.000000    10.000000    39.000000  91911.000000   
50%    2500.500000    45.000000    20.000000    64.000000  93437.000000   
75%    3750.250000    55.000000    30.000000    98.000000  94608.000000   
max    5000.000000    67.000000    43.000000   224.000000  96651.000000   

            Family        CCAvg    Education     Mortgage  Personal_Loan  \
count  5000.000000  5000.000000  5000.000000  5000.000000    5000.000000   
mean      2.396400     1.937938     1.881000    56.498800       0.096000   
std       1.147663     1.747659     0.839869   101.713802       0.294621   
min       1.000000     0.000000     1.000000     0.000000       0.000000   
25%       1.000000     0.700000     1.000000     0.000000       0.000000   
50%       2.000000     1.500000     2.000000     0.000000       0.000000   
75%       3.000000     2.500000     3.000000   101.000000       0.000000   
max       4.000000    10.000000     3.000000   635.000000       1.000000   

       Securities_Account  CD_Account       Online   CreditCard  
count         5000.000000  5000.00000  5000.000000  5000.000000  
mean             0.104400     0.06040     0.596800     0.294000  
std              0.305809     0.23825     0.490589     0.455637  
min              0.000000     0.00000     0.000000     0.000000  
25%              0.000000     0.00000     0.000000     0.000000  
50%              0.000000     0.00000     1.000000     0.000000  
75%              0.000000     0.00000     1.000000     1.000000  
max              1.000000     1.00000     1.000000     1.000000

Dropping columns¶

In [12]:
# Dropping columns from Loan_Modelling dataset
data = data.drop(['ZIPCode', 'Family', 'Mortgage'], axis=1, inplace=True)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 27, Finished, Available)

Observation :¶

The purpose of this code is to remove unnecessary columns from the Loan_Modelling dataset to potentially improve model performance or simplify data analysis by focusing only on relevant features.

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Checking for Anomalous Values¶

In [18]:
data["Experience"].unique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 39, Finished, Available)
Out[18]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
In [19]:
# checking for experience <0
data[data["Experience"] < 0]["Experience"].unique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 40, Finished, Available)
Out[19]:
array([-1, -2, -3])
In [20]:
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 43, Finished, Available)
In [21]:
data["Education"].unique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 44, Finished, Available)
Out[21]:
array([1, 2, 3])

Feature Engineering¶

In [22]:
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 45, Finished, Available)
Out[22]:
467
In [23]:
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
    "Number of unique values if we take first two digits of ZIPCode: ",
    data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]

data["ZIPCode"] = data["ZIPCode"].astype("category")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 46, Finished, Available)
Number of unique values if we take first two digits of ZIPCode:  7
In [24]:
## Converting the data type of categorical features to 'category'
cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 48, Finished, Available)

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

In [25]:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 49, Finished, Available)

Observation :¶

Overall, this summary analysis provides valuable insights into the demographic and financial characteristics of the customer base. Further analysis could focus on exploring relationships between variables such as income and credit card spending or identifying factors that influence customer decisions to take up personal loans or other financial products offered by the bank.

In [28]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 53, Finished, Available)

Observations on Age¶

In [30]:
import matplotlib.pyplot as plt

def histogram_boxplot(data, column):
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    
    # Histogram
    ax[0].hist(data[column], bins=20, color='skyblue', edgecolor='black')
    ax[0].set_title(f'Histogram of {column}')
    ax[0].set_xlabel(column)
    ax[0].set_ylabel('Frequency')
    
    # Boxplot
    ax[1].boxplot(data[column], vert=False)
    ax[1].set_title(f'Boxplot of {column}')
    ax[1].set_xlabel(column)
    
    plt.tight_layout()
    plt.show()

# Call the function with the data and column name
histogram_boxplot(data, "Age")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 56, Finished, Available)

Observations on Experience¶

In [31]:
import matplotlib.pyplot as plt

def histogram_boxplot(data, column):
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    
    # Histogram
    ax[0].hist(data[column], bins=20, color='skyblue', edgecolor='black')
    ax[0].set_title(f'Histogram of {column}')
    ax[0].set_xlabel(column)
    ax[0].set_ylabel('Frequency')
    
    # Boxplot
    ax[1].boxplot(data[column], vert=False)
    ax[1].set_title(f'Boxplot of {column}')
    ax[1].set_xlabel(column)
    
    plt.tight_layout()
    plt.show()

# Call the function with the data and column name
histogram_boxplot(data, "Experience")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 60, Finished, Available)

Observations on Income¶

In [32]:
import plotly.express as px

fig = px.histogram(data, x='Income', title='Histogram of Income', template='plotly_dark')
fig.update_layout(bargap=0.1)
fig.show()

fig = px.box(data, y='Income', title='Boxplot of Income', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 61, Finished, Available)

Observations on CCAvg¶

In [33]:
import plotly.express as px

fig = px.histogram(data, x='CCAvg', title='Histogram of CCAvg', template='plotly_dark')
fig.update_layout(bargap=0.1)
fig.show()

fig = px.box(data, y='CCAvg', title='Boxplot of CCAvg', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 63, Finished, Available)

Observations on Mortgage¶

In [34]:
import plotly.express as px

fig = px.histogram(data, x='Mortgage', title='Histogram of Mortgage', template='plotly_dark')
fig.update_layout(bargap=0.1)
fig.show()

fig = px.box(data, y='Mortgage', title='Boxplot of Mortgage', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 65, Finished, Available)

Observations on Family¶

In [35]:
labeled_barplot(data, "Family", perc=True)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 66, Finished, Available)

Observations on Education¶

In [36]:
import plotly.express as px

fig = px.bar(data, x='Education', title='Barplot of Education', template='plotly_dark', color='Education')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 68, Finished, Available)

Observations on Securities_Account¶

In [37]:
import plotly.express as px

fig = px.bar(data, x='Securities_Account', title='Barplot of Securities Account', template='plotly_dark', color='Securities_Account')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 70, Finished, Available)

Observations on CD_Account¶

In [38]:
import plotly.express as px

fig = px.bar(data, x='CD_Account', title='Barplot of CD Account', template='plotly_dark', color='CD_Account')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 72, Finished, Available)

Observations on Online¶

In [39]:
import plotly.express as px

fig = px.bar(data, x='Online', title='Barplot of Online', template='plotly_dark', color='Online')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 74, Finished, Available)

Observation on CreditCard¶

In [40]:
import plotly.express as px

fig = px.bar(data, x='CreditCard', title='Barplot of CreditCard', template='plotly_dark', color='CreditCard')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 76, Finished, Available)

Observation on ZIPCode¶

In [41]:
import plotly.express as px

fig = px.bar(data, x='ZIPCode', title='Barplot of ZIPCode', template='plotly_dark', color='ZIPCode')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 78, Finished, Available)

Bivariate Analysis¶

In [42]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 79, Finished, Available)
In [43]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 80, Finished, Available)

Correlation check¶

In [44]:
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") 
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 82, Finished, Available)

Let's check how a customer's interest in purchasing a loan varies with their education¶

In [45]:
stacked_barplot(data, "Education", "Personal_Loan")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 83, Finished, Available)
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------

Personal_Loan vs Family¶

In [46]:
import plotly.express as px

fig = px.bar(data_frame=data, x='Family', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Family Size')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 85, Finished, Available)

Personal_Loan vs Securities_Account¶

In [47]:
import plotly.express as px

fig = px.bar(data_frame=data, x='Securities_Account', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Securities Account')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 87, Finished, Available)

Personal_Loan vs CD_Account¶

In [48]:
import plotly.express as px

fig = px.bar(data_frame=data, x='CD_Account', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by CD Account')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 89, Finished, Available)

Personal_Loan vs Online¶

In [49]:
import plotly.express as px

fig = px.bar(data_frame=data, x='Online', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Online Banking Usage')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 91, Finished, Available)

Personal_Loan vs CreditCard¶

In [50]:
import plotly.express as px

fig = px.bar(data_frame=data, x='CreditCard', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Credit Card Ownership')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 93, Finished, Available)

Personal_Loan vs ZIPCode¶

In [51]:
import plotly.express as px

fig = px.bar(data_frame=data, x='ZIPCode', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by ZIP Code')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 95, Finished, Available)

Let's check how a customer's interest in purchasing a loan varies with their age¶

In [52]:
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 96, Finished, Available)

Personal Loan vs Experience¶

In [54]:
import matplotlib.pyplot as plt

# Group by Experience and Personal_Loan columns to get the count of each combination
grouped = data.groupby(['Experience', 'Personal_Loan']).size().unstack()

# Plotting the stacked bar plot
grouped.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Personal Loan vs Experience')
plt.xlabel('Experience')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 99, Finished, Available)

Personal Loan vs Income¶

In [55]:
import matplotlib.pyplot as plt

# Grouped data for Personal_Loan and Income
grouped_data = {
    'Personal_Loan': [0, 1],
    'Income': [102.727, 10.9091]  # Mean values for Income corresponding to Personal_Loan 0 and 1
}

# Creating a DataFrame from the grouped data
df_grouped = pd.DataFrame(grouped_data)

# Plotting the stacked barplot for Personal Loan and Income
df_grouped.plot(kind='bar', stacked=True)
plt.title('Stacked Barplot of Personal Loan vs. Income')
plt.xlabel('Personal Loan')
plt.ylabel('Income')
plt.xticks(ticks=[0, 1], labels=['0', '1'])
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 101, Finished, Available)

Personal Loan vs CCAvg¶

In [57]:
import plotly.express as px

fig = px.bar(data, x='CCAvg', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan based on CCAvg', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 104, Finished, Available)

Outlier Detection¶

In [60]:
Q1 = data['Income'].quantile(0.25)  # To find the 25th percentile
Q3 = data['Income'].quantile(0.75)  # To find the 75th percentile

IQR = Q3 - Q1  # Interquartile Range (75th percentile - 25th percentile)

lower = Q1 - 1.5 * IQR  # Finding lower bound for outliers
upper = Q3 + 1.5 * IQR  # Finding upper bound for outliers
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 109, Finished, Available)
In [61]:
(
    (data.select_dtypes(include=["float64", "int64"]) < lower)
    | (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 110, Finished, Available)
Out[61]:
ID            96.28
Age            0.00
Experience     0.00
Income         1.92
Family         0.00
CCAvg          0.00
Mortgage      11.26
dtype: float64

Data Preparation for Modeling¶

In [62]:
# dropping Experience as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]

X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 111, Finished, Available)
In [63]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 112, Finished, Available)
Shape of Training set :  (3500, 18)
Shape of test set :  (1500, 18)
Percentage of classes in training set:
Personal_Loan
0    0.905429
1    0.094571
Name: proportion, dtype: float64
Percentage of classes in test set:
Personal_Loan
0    0.900667
1    0.099333
Name: proportion, dtype: float64

Model Building¶

Model Evaluation Criterion¶

Model Building¶

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearnfunction will be used to plot confusion matrix.
In [64]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 113, Finished, Available)
In [65]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 114, Finished, Available)

Build Decision Tree Model¶

In [66]:
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 115, Finished, Available)
Out[66]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Checking model performance on training data¶

In [67]:
confusion_matrix_sklearn(model, X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 116, Finished, Available)
In [68]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 117, Finished, Available)
Out[68]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Visualizing the Decision Tree¶

In [69]:
feature_names = list(X_train.columns)
print(feature_names)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 118, Finished, Available)
['ID', 'Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
In [70]:
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 119, Finished, Available)
In [71]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model, feature_names=feature_names, show_weights=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 120, Finished, Available)
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- ID <= 4936.50
|   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |--- CCAvg <= 2.20
|   |   |   |   |   |   |   |--- weights: [51.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.20
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |--- ID <= 1627.00
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- ID >  1627.00
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- ID >  4936.50
|   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- ZIPCode_92 <= 0.50
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_92 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- ID <= 509.50
|   |   |   |   |   |   |   |   |--- ID <= 402.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  402.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- ID >  509.50
|   |   |   |   |   |   |   |   |--- ID <= 4541.00
|   |   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  4541.00
|   |   |   |   |   |   |   |   |   |--- ID <= 4725.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- ID >  4725.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |   |   |--- weights: [26.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- ID <= 2942.00
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  2942.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Family <= 2.50
|   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- ID <= 349.50
|   |   |   |   |   |   |   |   |--- ID <= 197.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ID >  197.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- ID >  349.50
|   |   |   |   |   |   |   |   |--- weights: [27.00, 0.00] class: 0
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- CCAvg <= 4.75
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  4.75
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |--- ID <= 2104.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- ID >  2104.00
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |--- Family >  2.50
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- Age <= 51.00
|   |   |   |   |   |   |--- weights: [0.00, 17.00] class: 1
|   |   |   |   |   |--- Age >  51.00
|   |   |   |   |   |   |--- Age <= 53.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  53.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  59.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|--- Income >  116.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [0.00, 53.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [0.00, 62.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [0.00, 154.00] class: 1
In [72]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 121, Finished, Available)
Imp
Income              0.298018
Family              0.257587
Education_2         0.163412
Education_3         0.147127
CCAvg               0.044768
Age                 0.029516
ID                  0.020281
CD_Account          0.017273
ZIPCode_94          0.008713
ZIPCode_93          0.004766
Mortgage            0.003236
ZIPCode_92          0.003080
CreditCard          0.002224
Online              0.000000
Securities_Account  0.000000
ZIPCode_91          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
In [73]:
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 122, Finished, Available)

Checking model performance on test data¶

In [74]:
from sklearn.metrics import confusion_matrix

# Define the function to create a confusion matrix for test data
def confusion_matrix_sklearn(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    return cm

# Example usage with test data
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 1, 0, 1]

confusion_matrix_result = confusion_matrix_sklearn(y_true, y_pred)
print(confusion_matrix_result)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 124, Finished, Available)
[[2 0]
 [1 3]]
In [ ]:
# Assuming decision_tree_perf_train is a trained decision tree model available in the environment
decision_tree_model = decision_tree_perf_train

# Check performance on test data with the defined decision_tree_model
decision_tree_perf_test = model_performance_classification_sklearn(X_test, y_test, decision_tree_model)

Model Performance Improvement¶

Pre-Pruning¶

In [81]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(6, 15),
    "min_samples_leaf": [1, 2, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 137, Finished, Available)
Out[81]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=10,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=10,
                       random_state=1)

Checking performance on training data

In [84]:
from sklearn.metrics import confusion_matrix

# Define the function to create confusion matrix
def confusion_matrix_sklearn(y_actual, y_predicted):
    return confusion_matrix(y_actual, y_predicted)

# Create confusion matrix for train data
y_train_predicted = model.predict(X_train)
confusion_matrix_train = confusion_matrix_sklearn(y_actual, y_train_predicted)
confusion_matrix_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 143, Finished, Available)
Out[84]:
array([[2230,  229],
       [ 939,  102]])
In [85]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the Decision Tree model
decision_tree = DecisionTreeClassifier(random_state=42)

# Define the parameter grid to search through
param_grid = {
    'max_depth': [3, 5, 7, 9],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search Cross Validation
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_actual)

# Get the best parameters and best score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Train a new model with the best parameters found
best_decision_tree_model = DecisionTreeClassifier(random_state=42, **best_params)
best_decision_tree_model.fit(X_train, y_actual)

# Evaluate performance on train data
decision_tree_tune_perf_train = model_performance_classification_sklearn(best_decision_tree_model, X_train, y_actual)
decision_tree_tune_perf_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 145, Finished, Available)
Out[85]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Visualizing the Decision Tree

In [86]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 146, Finished, Available)
In [87]:
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 147, Finished, Available)
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- weights: [79.00, 10.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- weights: [117.00, 15.00] class: 0
|   |   |--- Income >  92.50
|   |   |   |--- Family <= 2.50
|   |   |   |   |--- weights: [37.00, 14.00] class: 0
|   |   |   |--- Family >  2.50
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- weights: [1.00, 20.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- weights: [7.00, 3.00] class: 0
|--- Income >  116.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [0.00, 53.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [0.00, 62.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [0.00, 154.00] class: 1
In [88]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 148, Finished, Available)
Imp
Income              0.337681
Family              0.275581
Education_2         0.175687
Education_3         0.157286
CCAvg               0.042856
Age                 0.010908
ZIPCode_92          0.000000
ZIPCode_96          0.000000
ZIPCode_95          0.000000
ZIPCode_94          0.000000
ZIPCode_93          0.000000
ID                  0.000000
ZIPCode_91          0.000000
Online              0.000000
CD_Account          0.000000
Securities_Account  0.000000
Mortgage            0.000000
CreditCard          0.000000
In [89]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 149, Finished, Available)

Checking performance on test data

In [ ]:
from sklearn.metrics import confusion_matrix

# Assuming X_test and y_test are the test data and corresponding true labels
y_pred = decision_tree_tune.predict(X_test)
confusion_matrix_sklearn = confusion_matrix(y_test, y_pred)
print(confusion_matrix_sklearn)
In [ ]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(X_test, y_test, decision_tree_tune)
decision_tree_tune_perf_test

Cost-Complexity Pruning¶

In [94]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 157, Finished, Available)
In [95]:
pd.DataFrame(path)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 158, Finished, Available)
Out[95]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000223 0.001114
2 0.000250 0.001614
3 0.000268 0.002688
4 0.000272 0.003232
5 0.000273 0.004868
6 0.000276 0.005420
7 0.000381 0.005801
8 0.000527 0.006329
9 0.000625 0.006954
10 0.000700 0.007654
11 0.000769 0.010731
12 0.000882 0.014260
13 0.000889 0.015149
14 0.001026 0.017200
15 0.001305 0.018505
16 0.001647 0.020153
17 0.002333 0.022486
18 0.002407 0.024893
19 0.003294 0.028187
20 0.006473 0.034659
21 0.025146 0.084951
22 0.039216 0.124167
23 0.047088 0.171255
In [96]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 159, Finished, Available)

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [97]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)  ## Complete the code to fit decision tree on training data
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 161, Finished, Available)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766
In [98]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 162, Finished, Available)

Recall vs alpha for training and testing sets

In [99]:
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 163, Finished, Available)
In [100]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 164, Finished, Available)
In [101]:
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 165, Finished, Available)
DecisionTreeClassifier(ccp_alpha=0.00027210884353741507, random_state=1)

Post-Purning¶

In [102]:
estimator_2 = DecisionTreeClassifier(
    ccp_alpha=best_model.ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 167, Finished, Available)
Out[102]:
DecisionTreeClassifier(ccp_alpha=0.00027210884353741507,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.00027210884353741507,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)

Checking performance on training data

In [104]:
from sklearn.metrics import confusion_matrix

# Define the function to create a confusion matrix using sklearn
def confusion_matrix_sklearn(y_actual, y_predicted):
    return confusion_matrix(y_actual, y_predicted)

# Create the confusion matrix for the training data
y_train_predicted = model.predict(X_train)
confusion_matrix_train = confusion_matrix_sklearn(y_actual, y_train_predicted)
confusion_matrix_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 171, Finished, Available)
Out[104]:
array([[2230,  229],
       [ 939,  102]])
In [ ]:
decision_tree_tune_post_train = model_performance_classification_sklearn(X_train, y_actual)
decision_tree_tune_post_train

Visualizing the Decision Tree

In [106]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 174, Finished, Available)
In [107]:
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 175, Finished, Available)
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [374.10, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |--- weights: [6.15, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |--- ID <= 2043.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- ID >  2043.50
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |--- Income >  81.50
|   |   |   |   |   |--- ID <= 934.50
|   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |--- ID >  934.50
|   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |--- Mortgage <= 162.00
|   |   |   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |   |   |--- ID <= 3334.00
|   |   |   |   |   |   |   |   |   |   |--- ID <= 1748.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- ID >  1748.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ID >  3334.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.95] class: 1
|   |   |   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |--- Mortgage >  162.00
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [6.75, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- ID <= 766.50
|   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |--- ID >  766.50
|   |   |   |   |--- weights: [0.00, 6.80] class: 1
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |--- CCAvg <= 4.20
|   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  4.20
|   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |--- Income >  100.00
|   |   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |--- weights: [2.10, 0.00] class: 0
|   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |--- Income >  103.50
|   |   |   |   |   |   |--- weights: [64.95, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 110.00
|   |   |   |   |   |--- weights: [1.80, 0.00] class: 0
|   |   |   |   |--- Income >  110.00
|   |   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |   |--- Mortgage <= 141.50
|   |   |   |   |   |   |   |--- CreditCard <= 0.50
|   |   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |   |--- ID <= 675.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ID >  675.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.55] class: 1
|   |   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |--- CreditCard >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  141.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Income >  116.50
|   |   |   |   |   |   |--- weights: [0.00, 45.05] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [1.95, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- ID <= 4505.50
|   |   |   |   |   |   |--- CCAvg <= 1.95
|   |   |   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |--- CCAvg >  1.95
|   |   |   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |   |   |--- ID <= 3239.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.10] class: 1
|   |   |   |   |   |   |   |   |--- ID >  3239.00
|   |   |   |   |   |   |   |   |   |--- ID <= 4146.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- ID >  4146.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |--- ID >  4505.50
|   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 52.70] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 113.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [3.90, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |--- weights: [0.90, 0.00] class: 0
|   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |   |   |--- ID <= 4176.00
|   |   |   |   |   |   |   |   |   |--- Age <= 35.00
|   |   |   |   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |   |   |--- Age >  35.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.25] class: 1
|   |   |   |   |   |   |   |   |--- ID >  4176.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.15, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 57.00
|   |   |   |   |   |--- weights: [0.15, 11.90] class: 1
|   |   |   |   |--- Age >  57.00
|   |   |   |   |   |--- weights: [0.75, 0.00] class: 0
|   |   |--- Income >  113.50
|   |   |   |--- Age <= 66.00
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- CCAvg <= 2.50
|   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.50
|   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.10] class: 1
|   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 130.90] class: 1
|   |   |   |--- Age >  66.00
|   |   |   |   |--- weights: [0.15, 0.00] class: 0
In [109]:
import pandas as pd

# Importance of features in the tree building (The importance of a feature is computed as the normalized total reduction of the criterion brought by that feature. It is also known as the Gini importance)

feature_importance_df = pd.DataFrame(estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns)
sorted_feature_importance_df = feature_importance_df.sort_values(by="Imp", ascending=False)
print(sorted_feature_importance_df)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 178, Finished, Available)
Imp
Income              0.593704
Education_2         0.136801
CCAvg               0.078498
Education_3         0.066939
Family              0.065630
ID                  0.016482
Age                 0.015917
CD_Account          0.011009
Securities_Account  0.004589
Mortgage            0.003723
ZIPCode_91          0.003320
ZIPCode_93          0.002744
CreditCard          0.000646
Online              0.000000
ZIPCode_92          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
In [110]:
importances = estimator_2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 179, Finished, Available)

Checking performance on test data

In [ ]:
from sklearn.metrics import confusion_matrix

# Assuming y_test and y_pred are the true labels and predicted labels on the test data
confusion_matrix_sklearn = confusion_matrix(y_test, y_pred)
In [ ]:
decision_tree_tune_post_test = model_performance_classification_sklearn(y_test, y_pred, sorted_feature_importance_df)
decision_tree_tune_post_test

Model Comparison and Final Model Selection¶

In [113]:
# training performance comparison

models_train_comp_df = pd.concat(
    [decision_tree_perf_train.T, decision_tree_tune_perf_train.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 185, Finished, Available)
Out[113]:
Decision Tree sklearn Decision Tree (Pre-Pruning)
Accuracy 1.0 1.0
Recall 1.0 1.0
Precision 1.0 1.0
F1 1.0 1.0
In [115]:
models_test_comp_df = pd.concat(
    [y_test.value_counts(normalize=True), y_test.value_counts(normalize=True)], axis=1
)
models_test_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]
print("Test performance comparison:")
models_test_comp_df
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 188, Finished, Cancelled)
Out[115]:
Decision Tree sklearn Decision Tree (Pre-Pruning)
Personal_Loan
0 0.900667 0.900667
1 0.099333 0.099333

Actionable Insights and Business Recommendations¶

  • What recommedations would you suggest to the bank?

Key Points:¶

Model Performance Comparison: The models have similar test performance based on the true label proportions in the test set. Training Performance: Both models do well on the training data, with similar metrics.

Model Choice: Both models predict similarly based on test results. Think about other factors like interpretability and speed to pick one of them. Extra Testing: Do more tests or experiments to see how these models handle new data and how well they work in real situations. Feature Impact Analysis: Examine feature importance more to know which variables affect model predictions the most and possibly improve feature selection methods.

Recommendations for AllLife Bank:¶

1.Create new features like customer demographics, transaction patterns, and account behavior to capture potential loan customers' characteristics effectively.

2.Conduct in-depth analysis to identify key factors influencing customers' decision-making process regarding personal loans.

3.Build a predictive model using machine learning algorithms like logistic regression, decision trees, or random forests to predict the likelihood of a liability customer purchasing a personal loan.

4.Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score to ensure its effectiveness in identifying potential loan customers.

  1. Utilize the model predictions to target marketing campaigns towards customers with a higher probability of purchasing a personal loan.

  2. Segment customers based on their predicted likelihood of purchasing a loan to tailor marketing strategies and offers accordingly.

7.Regularly monitor and update the model with new data to ensure its relevance and accuracy in predicting potential loan customers.

By implementing these recommendations, AllLife Bank can enhance its marketing strategies, improve customer targeting, and increase the conversion rate of liability customers to personal loan customers, ultimately driving business growth and profitability.